Throughout, I practiced troubleshooting AWS CloudFormation deployments through a series of tasks:
I created a proof of concept for the Café's deployment strategy using Infrastructure as Code (IaC).
I started by practicing JMESPath queries at jmespath.org to prepare for working with JSON-formatted data. This was really helpful since I knew I'd be using similar queries with the AWS CLI later.
First, I copied this JSON document to work with:
{
"desserts": [
{
"name": "Chocolate cake",
"price": "20.00"
},
{
"name": "Ice cream",
"price": "15.00"
},
{
"name": "Carrot cake",
"price": "22.00"
}
]
}
I tried several queries to get familiar with the syntax:
Then I practiced with AWS resource data:
{
"StackResources": [
{
"LogicalResourceId": "VPC",
"ResourceType": "AWS::EC2::VPC"
},
{
"LogicalResourceId": "PublicSubnet1",
"ResourceType": "AWS::EC2::Subnet"
},
{
"LogicalResourceId": "CliHostInstance",
"ResourceType": "AWS::EC2::Instance"
}
]
}
I figured out that to get the LogicalResourceId of the EC2 instance, I needed to use:
StackResources[?ResourceType == 'AWS::EC2::Instance'].LogicalResourceId
During this work, I was encouraged to practice using JMESPath expressions throughout, especially with the AWS CLI's --query or --filter parameters. This proved extremely valuable for filtering the output from AWS CLI commands.
After understanding JMESPath basics, I moved on to establishing an SSH connection to the CLI Host instance. Once connected, I determined which region I was working in:
curl http://169.254.169.254/latest/dynamic/instance-identity/document | grep region
Then I configured the AWS CLI with my credentials:
aws configure
I entered my AccessKey, SecretKey, region name (matching where my EC2 instance was running), and set the output format to JSON.
The environment started by providing me an Amazon EC2 instance named CLI Host that already existed in a VPC named VPC2. This setup allowed me to use the AWS CLI to run CloudFormation commands.
I examined the CloudFormation template that I'd be working with:
less template1.yaml
The template was quite comprehensive, containing:
I pressed RETURN (ENTER) to scroll through the contents of the file while using the `less` command.
I then tried to create my first stack:
aws cloudformation create-stack \ --stack-name myStack \ --template-body file://template1.yaml \ --capabilities CAPABILITY_NAMED_IAM \ --parameters ParameterKey=KeyName,ParameterValue=vockey
To monitor the creation process, I ran:
watch -n 5 -d \ aws cloudformation describe-stack-resources \ --stack-name myStack \ --query 'StackResources[*].[ResourceType,ResourceStatus]' \ --output table
In this command, the watch Linux utility runs the same command every 5 seconds and briefly highlights changes as they occur. The --output table parameter makes reading the results easier.
After watching for a few minutes, I noticed something strange - resources were being created but then suddenly started being deleted. Clearly something was wrong! I pressed CTRL+C to exit the watch utility and ran:
watch -n 5 -d \ aws cloudformation describe-stacks \ --stack-name myStack \ --output table
The stack status showed CREATE_FAILED and then went to ROLLBACK_IN_PROGRESS followed by ROLLBACK_COMPLETE. This confirmed that CloudFormation was automatically rolling back due to a failure.
This behavior was expected as part of the exercise.
To understand what went wrong, I checked for CREATE_FAILED events:
aws cloudformation describe-stack-events \ --stack-name myStack \ --query "StackEvents[?ResourceStatus == 'CREATE_FAILED']"
The output showed that the WaitCondition timed out. Since the WaitCondition was supposed to receive a signal from the userdata script on the EC2 instance, this suggested a problem with that script. Unfortunately, the automatic rollback had already deleted the EC2 instance, so I couldn't check its logs.
By default, AWS CloudFormation deletes all resources if any resource defined in the template cannot be successfully created. Because the wait condition resource failed, the entire stack failed and all changes were rolled back.
I verified the stack status was now ROLLBACK_COMPLETE:
aws cloudformation describe-stacks \ --stack-name myStack \ --output table
I deleted this failed stack:
aws cloudformation delete-stack --stack-name myStack
The stack was deleted quickly because it didn't contain resources that needed to be rolled back.
For my second attempt, I decided to prevent CloudFormation from rolling back on failure so I could investigate:
aws cloudformation create-stack \ --stack-name myStack \ --template-body file://template1.yaml \ --capabilities CAPABILITY_NAMED_IAM \ --on-failure DO_NOTHING \ --parameters ParameterKey=KeyName,ParameterValue=vockey
I monitored the creation again:
watch -n 5 -d \ aws cloudformation describe-stack-resources \ --stack-name myStack \ --query 'StackResources[*].[ResourceType,ResourceStatus]' \ --output table
This time, when the WaitCondition failed, the other resources remained in CREATE_COMPLETE status instead of being deleted. Perfect!
To exit the watch utility, I pressed CTRL+C.
I confirmed the stack was in CREATE_FAILED status:
aws cloudformation describe-stacks \ --stack-name myStack \ --output table
I verified that the same WaitCondition timeout was the issue:
aws cloudformation describe-stack-events \ --stack-name myStack \ --query "StackEvents[?ResourceStatus == 'CREATE_FAILED']"
Now I could connect to the Web Server EC2 instance to see what went wrong. First, I needed its IP address:
aws ec2 describe-instances \ --filters "Name=tag:Name,Values='Web Server'" \ --query 'Reservations[].Instances[].[State.Name,PublicIpAddress]'
I opened a new terminal window and connected to the Web Server via SSH. Once connected, I examined the cloud-init-output.log:
tail -50 /var/log/cloud-init-output.log
I noticed two critical errors:
Next, I looked at the part-001 script to see what went wrong:
sudo cat /var/lib/cloud/instance/scripts/part-001
I noticed the script had a #! line with the -e parameter, which meant it would immediately stop if any command failed. The issue was clear - it was trying to install a package called "http" which didn't exist (it should have been "httpd" for the Apache web server).
In summary, because no package named http could be found, the userdata script failed. Therefore, the wait condition never received the success signal, and after 2 minutes, the wait condition timed out. This was why the stack failed.
I disconnected from the Web Server instance and closed that terminal window.
Back on the CLI Host, I edited the template to fix the issue:
vim template1.yaml
I navigated to line 128 and changed "http" to "httpd". I verified the fix with:
cat template1.yaml | grep httpd
When using the vi editor, I used the UP and DOWN arrow keys to position my cursor, entered "a" to enter edit mode, made the change, pressed ESC to exit edit mode, and entered ":wq" followed by ENTER to write the change and quit.
I made sure to check that the file was updated correctly. If the yum line had not appeared, my changes might not have been saved.
I deleted the failed stack:
aws cloudformation delete-stack --stack-name myStack
I monitored the deletion with:
watch -n 5 -d \ aws cloudformation describe-stacks \ --stack-name myStack \ --output table
Once the deletion was complete, I created a new stack with the corrected template:
aws cloudformation create-stack \ --stack-name myStack \ --template-body file://template1.yaml \ --capabilities CAPABILITY_NAMED_IAM \ --on-failure DO_NOTHING \ --parameters ParameterKey=KeyName,ParameterValue=vockey
I monitored the creation:
watch -n 5 -d \ aws cloudformation describe-stack-resources \ --stack-name myStack \ --query 'StackResources[*].[ResourceType,ResourceStatus]' \ --output table
This time, all resources were created successfully! I confirmed with:
aws cloudformation describe-stacks \ --stack-name myStack \ --output table
The stack showed CREATE_COMPLETE, and I could see the PublicIP of the web server and the S3 bucket name in the Outputs section.
I tested the web server by opening the public IP in my browser and saw the "Hello from your web server!" message.
I successfully figured out why the stack was failing, discovered the root cause by looking at log files on the EC2 instance, and updated the CloudFormation template to successfully create a set of resources. Using the WaitCondition in combination with the -e parameter in the userdata script ensured that every command ran without error.
Next, I wanted to explore CloudFormation's drift detection, so I manually modified a resource:
I arranged the AWS Management Console tab so that it displayed alongside the instructions to make it easier
I also added an object to the S3 bucket created by CloudFormation:
bucketName=$(\ aws cloudformation describe-stacks \ --stack-name myStack \ --query "Stacks[*].Outputs[?OutputKey \ == 'BucketName'].[OutputValue]" \ --output text) echo "bucketName = "$bucketName touch myfile aws s3 cp myfile s3://$bucketName/ aws s3 ls $bucketName/
To detect these changes, I ran:
aws cloudformation detect-stack-drift --stack-name myStack
I monitored the drift detection status:
aws cloudformation describe-stack-drift-detection-status \ --stack-drift-detection-id
The output showed "StackDriftStatus": "DRIFTED", indicating that at least one resource had changed.
I examined which resources had drifted:
aws cloudformation describe-stack-resources \ --stack-name myStack \ --query 'StackResources[*].[ResourceType,ResourceStatus,DriftInformation.StackResourceDriftStatus]' \ --output table
As expected, the security group showed MODIFIED status, but interestingly, the S3 bucket still showed IN_SYNC. This is because adding files to a bucket doesn't register as drift in CloudFormation (only property changes do).
I looked at the specific details of the security group drift:
aws cloudformation describe-stack-resource-drifts \ --stack-name myStack \ --stack-resource-drift-status-filters MODIFIED
The PropertyDifferences section showed that port 22 was now open only to my IP address instead of 0.0.0.0/0.
I tried updating the stack to see if it would resolve the drift:
aws cloudformation update-stack \ --stack-name myStack \ --template-body file://template1.yaml \ --parameters ParameterKey=KeyName,ParameterValue=vockey
As expected, this didn't automatically resolve the drift. I would need to manually resolve these issues.
The update-stack command does not automatically resolve drift, even though drift has occurred. Manual intervention is required to eliminate drift.
Finally, I attempted to delete the stack:
aws cloudformation delete-stack --stack-name myStack
I monitored the deletion:
watch -n 5 -d \ aws cloudformation describe-stack-resources \ --stack-name myStack \ --query 'StackResources[*].[ResourceType,ResourceStatus]' \ --output table
Most resources were successfully deleted, but the S3 bucket showed DELETE_FAILED. I checked the stack status:
aws cloudformation describe-stacks \ --stack-name myStack \ --output table
The status showed "DELETE_FAILED" with the reason: "The following resource(s) failed to delete: [MyBucket]."
This made sense - CloudFormation won't delete a bucket that contains objects to prevent accidental data loss.
One approach would be to manually delete or move the file from the S3 bucket and then run the delete-stack command again. However, this might not be appropriate if people in the organization have already started storing files in the bucket and other systems depend on the bucket name and location not changing.
For the challenge of keeping the bucket and its contents while successfully deleting the stack, I:
aws cloudformation describe-stack-resources \ --stack-name myStack \ --query "StackResources[?ResourceType=='AWS::S3::Bucket'].LogicalResourceId" \ --output textThis returned "MyBucket".
aws cloudformation delete-stack \ --stack-name myStack \ --retain-resources MyBucket
aws s3 ls s3://$bucketName/
aws cloudformation describe-stacks \ --stack-name myStackThis returned an error indicating the stack no longer existed.
This approach successfully kept the S3 bucket and its content while completely removing the CloudFormation stack - perfect for scenarios where you want to preserve certain resources while removing the stack infrastructure.
In this work, I created a deployment of a web server inside a custom VPC using an AWS CloudFormation template and learned how to troubleshoot deployments to understand the effectiveness of this technology.
I was excited to inform the business owners about the proof of concept work using CloudFormation. I explained how this "Infrastructure as Code" approach was causing me to completely rethink the café's current approach to managing AWS resources that support the cafe website and other business applications.
They were very interested, and they discussed how they could use this technology to reliably create matching but separate development and production environments for new feature development. They also saw the value in features such as drift detection and the ability to tear down and rebuild complex cloud infrastructures consisting of resources from multiple AWS services.
This hands-on troubleshooting experience with AWS CloudFormation taught me valuable lessons about:
The work objectives were successfully completed:
I can see how this "Infrastructure as Code" approach would be incredibly useful for creating consistent cloud deployments, detecting unauthorized changes, and managing complex infrastructure reliably.